Section 1 - Introduction to dplyr
Load the dplyr and hflights package
We will learn the ins and outs of working with dplyr. dplyr is an R package, a collection of functions and data sets that enhance the R language.
Throughout this course you will use dplyr to analyze a data set of airline flight data containing flights that departed from Houston. This data is stored in a package called hflights.
Both dplyr and hflights are already installed on DataCamp’s servers, so loading them with library() will get you up and running.
install.packages("hflights")# Load the dplyr package
library(dplyr)
# Load the hflights package
library(hflights)
# Call both head() and summary() on hflights
hflights %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Year | Month | DayofMonth | DayOfWeek | DepTime | ArrTime | UniqueCarrier | FlightNum | TailNum | ActualElapsedTime | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5424 | 2011 | 1 | 1 | 6 | 1400 | 1500 | AA | 428 | N576AA | 60 | 40 | -10 | 0 | IAH | DFW | 224 | 7 | 13 | 0 | 0 | |
| 5425 | 2011 | 1 | 2 | 7 | 1401 | 1501 | AA | 428 | N557AA | 60 | 45 | -9 | 1 | IAH | DFW | 224 | 6 | 9 | 0 | 0 | |
| 5426 | 2011 | 1 | 3 | 1 | 1352 | 1502 | AA | 428 | N541AA | 70 | 48 | -8 | -8 | IAH | DFW | 224 | 5 | 17 | 0 | 0 | |
| 5427 | 2011 | 1 | 4 | 2 | 1403 | 1513 | AA | 428 | N403AA | 70 | 39 | 3 | 3 | IAH | DFW | 224 | 9 | 22 | 0 | 0 | |
| 5428 | 2011 | 1 | 5 | 3 | 1405 | 1507 | AA | 428 | N492AA | 62 | 44 | -3 | 5 | IAH | DFW | 224 | 9 | 9 | 0 | 0 | |
| 5429 | 2011 | 1 | 6 | 4 | 1359 | 1503 | AA | 428 | N262AA | 64 | 45 | -7 | -1 | IAH | DFW | 224 | 6 | 13 | 0 | 0 | |
| 5430 | 2011 | 1 | 7 | 5 | 1359 | 1509 | AA | 428 | N493AA | 70 | 43 | -1 | -1 | IAH | DFW | 224 | 12 | 15 | 0 | 0 | |
| 5431 | 2011 | 1 | 8 | 6 | 1355 | 1454 | AA | 428 | N477AA | 59 | 40 | -16 | -5 | IAH | DFW | 224 | 7 | 12 | 0 | 0 | |
| 5432 | 2011 | 1 | 9 | 7 | 1443 | 1554 | AA | 428 | N476AA | 71 | 41 | 44 | 43 | IAH | DFW | 224 | 8 | 22 | 0 | 0 | |
| 5433 | 2011 | 1 | 10 | 1 | 1443 | 1553 | AA | 428 | N504AA | 70 | 45 | 43 | 43 | IAH | DFW | 224 | 6 | 19 | 0 | 0 |
sum_hfl <- as.data.frame(do.call(cbind, lapply(hflights, summary)))## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 1)
sum_hfl[,-c(1,2,3)] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| DayOfWeek | DepTime | ArrTime | UniqueCarrier | FlightNum | TailNum | ActualElapsedTime | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | 1 | 1 | 1 | 227496 | 1 | 227496 | 34 | 11 | -70 | -33 | 227496 | 227496 | 79 | 1 | 1 | 0 | 227496 | 0 |
| 1st Qu. | 2 | 1021 | 1215 | character | 855 | character | 77 | 58 | -8 | -3 | character | character | 376 | 4 | 10 | 0 | character | 0 |
| Median | 4 | 1416 | 1617 | character | 1696 | character | 128 | 107 | 0 | 0 | character | character | 809 | 5 | 14 | 0 | character | 0 |
| Mean | 3.94769138798045 | 1395.7556135375 | 1578.25410595731 | 227496 | 1961.66314132996 | 227496 | 129.323744606341 | 108.142334527457 | 7.09433431305109 | 9.44495104434283 | 227496 | 227496 | 787.783244540563 | 6.098854876799 | 15.0910981567498 | 0.0130683616415234 | 227496 | 0.00285279741182263 |
| 3rd Qu. | 6 | 1801 | 1953 | character | 2755 | character | 165 | 141 | 11 | 9 | character | character | 1042 | 7 | 18 | 0 | character | 0 |
| Max. | 7 | 2400 | 2400 | character | 7290 | character | 575 | 549 | 978 | 981 | character | character | 3904 | 165 | 163 | 1 | character | 1 |
| NA’s | 1 | 2905 | 3066 | 227496 | 1 | 227496 | 3622 | 3622 | 3622 | 2905 | 227496 | 227496 | 79 | 3066 | 2947 | 0 | 227496 | 0 |
Explore the data set
A data scientist must be familiar with his or her data. Experiment with the data set in the console and maybe try to generate some insightful plots. For your convenience, hflights is already loaded into the workspace.
How many observations and how many variables are contained in the hflights data set?
dim(hflights)## [1] 227496 21
The variables are stored in the columns, whereas the observations are stored in the rows of this data set. As you can see, this data set is relatively large. In fact, it’s about 25Mb in size. For your convenience, we will continue the course’s exercises with a random subset of the data set.
Section 2 - tbl, a special type of data.frame
Convert data.frame to tibble
A tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble().
The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.
# Both the dplyr and hflights packages are loaded
# Convert the hflights_df data.frame into a hflights tbl
hflights_df <- hflights[sample(nrow(hflights), 720), ]
hflights <- as_tibble(hflights_df)
# Display the hflights tbl
hflights## # A tibble: 720 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 3 11 5 1544 1741 OO 1170
## 2 2011 1 17 1 1252 1622 CO 1850
## 3 2011 2 4 5 1503 1554 WN 2477
## 4 2011 10 8 6 2058 2202 CO 1019
## 5 2011 5 10 2 1525 1901 US 1978
## 6 2011 2 25 5 1759 1858 XE 2656
## 7 2011 7 11 1 1910 2012 XE 2911
## 8 2011 4 2 6 1907 2007 CO 479
## 9 2011 11 4 5 1635 1812 XE 4277
## 10 2011 1 28 5 715 847 XE 2606
## # ... with 710 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
# Create the object carriers
hflights$carriers <- hflights$UniqueCarrierChanging labels of hflights, part 1 of 2
A bit of cleaning would be a good idea since the UniqueCarrier variable of hflights uses a confusing code system.
To do this, let’s work with a lookup table, that comes in the form of a named vector. When you subset the lookup table with a character string (like the character strings in UniqueCarrier), R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:
two <- c("AA", "AS")
lut <- c("AA" = "American",
"AS" = "Alaska",
"B6" = "JetBlue")
two <- lut[two]
two## AA AS
## "American" "Alaska"
# Both the dplyr and hflights packages are loaded into workspace
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
# Add the Carrier column to hflights
hflights$Carrier <- lut[hflights$UniqueCarrier]
# Glimpse at hflights
glimpse(hflights)## Observations: 720
## Variables: 23
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 3, 1, 2, 10, 5, 2, 7, 4, 11, 1, 4, 7, 6, 2, ...
## $ DayofMonth <int> 11, 17, 4, 8, 10, 25, 11, 2, 4, 28, 28, 23, ...
## $ DayOfWeek <int> 5, 1, 5, 6, 2, 5, 1, 6, 5, 5, 4, 6, 4, 1, 4,...
## $ DepTime <int> 1544, 1252, 1503, 2058, 1525, 1759, 1910, 19...
## $ ArrTime <int> 1741, 1622, 1554, 2202, 1901, 1858, 2012, 20...
## $ UniqueCarrier <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ FlightNum <int> 1170, 1850, 2477, 1019, 1978, 2656, 2911, 47...
## $ TailNum <chr> "N742SK", "N32626", "N252WN", "N37409", "N53...
## $ ActualElapsedTime <int> 117, 150, 51, 124, 156, 119, 62, 60, 97, 92,...
## $ AirTime <int> 94, 128, 38, 108, 129, 97, 41, 40, 77, 72, 5...
## $ ArrDelay <int> 15, -10, 44, -18, 3, -10, -10, 1, 10, -6, -6...
## $ DepDelay <int> 14, -3, 48, -2, -5, -6, -5, -3, 10, -5, -2, ...
## $ Origin <chr> "IAH", "IAH", "HOU", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "MCI", "RDU", "SAT", "DEN", "CLT", "ELP", "M...
## $ Distance <int> 643, 1043, 192, 862, 913, 667, 262, 191, 562...
## $ TaxiIn <int> 5, 3, 4, 7, 10, 3, 3, 2, 10, 8, 5, 7, 6, 2, ...
## $ TaxiOut <int> 18, 19, 9, 9, 17, 19, 18, 18, 10, 12, 11, 6,...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ carriers <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ Carrier <chr> "SkyWest", "Continental", "Southwest", "Cont...
Changing labels of hflights, part 2 of 2
Let’s try a similar thing, but this time to change the labels in the CancellationCode column. This column lists reasons why a flight was cancelled using a non-informative alphabetical code. Execute
unique(hflights$CancellationCode)
A lookup table lut has already been created for you, that converts the alphabetical codes into more meaningful strings.
# The hflights tbl you built in the previous exercise is available in the workspace.
# The lookup table
lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")
# Add the Code column
hflights$Code <- lut[hflights$CancellationCode]
# Glimpse at hflights
glimpse(hflights)## Observations: 720
## Variables: 24
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 3, 1, 2, 10, 5, 2, 7, 4, 11, 1, 4, 7, 6, 2, ...
## $ DayofMonth <int> 11, 17, 4, 8, 10, 25, 11, 2, 4, 28, 28, 23, ...
## $ DayOfWeek <int> 5, 1, 5, 6, 2, 5, 1, 6, 5, 5, 4, 6, 4, 1, 4,...
## $ DepTime <int> 1544, 1252, 1503, 2058, 1525, 1759, 1910, 19...
## $ ArrTime <int> 1741, 1622, 1554, 2202, 1901, 1858, 2012, 20...
## $ UniqueCarrier <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ FlightNum <int> 1170, 1850, 2477, 1019, 1978, 2656, 2911, 47...
## $ TailNum <chr> "N742SK", "N32626", "N252WN", "N37409", "N53...
## $ ActualElapsedTime <int> 117, 150, 51, 124, 156, 119, 62, 60, 97, 92,...
## $ AirTime <int> 94, 128, 38, 108, 129, 97, 41, 40, 77, 72, 5...
## $ ArrDelay <int> 15, -10, 44, -18, 3, -10, -10, 1, 10, -6, -6...
## $ DepDelay <int> 14, -3, 48, -2, -5, -6, -5, -3, 10, -5, -2, ...
## $ Origin <chr> "IAH", "IAH", "HOU", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "MCI", "RDU", "SAT", "DEN", "CLT", "ELP", "M...
## $ Distance <int> 643, 1043, 192, 862, 913, 667, 262, 191, 562...
## $ TaxiIn <int> 5, 3, 4, 7, 10, 3, 3, 2, 10, 8, 5, 7, 6, 2, ...
## $ TaxiOut <int> 18, 19, 9, 9, 17, 19, 18, 18, 10, 12, 11, 6,...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ carriers <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ Carrier <chr> "SkyWest", "Continental", "Southwest", "Cont...
## $ Code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Session info
sessionInfo()## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Switzerland.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] hflights_0.1 ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0
## [5] kableExtra_1.0.1 knitr_1.21
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 highr_0.7 plyr_1.8.4
## [4] pillar_1.3.1 compiler_3.5.2 prettydoc_0.2.1
## [7] tools_3.5.2 digest_0.6.18 gtable_0.2.0
## [10] evaluate_0.12 tibble_2.0.1 viridisLite_0.3.0
## [13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1
## [16] rstudioapi_0.9.0 yaml_2.2.0 xfun_0.4
## [19] withr_2.1.2 httr_1.4.0 stringr_1.4.0
## [22] xml2_1.2.0 hms_0.4.2 webshot_0.5.1
## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0
## [28] R6_2.4.0 fansi_0.4.0 rmarkdown_1.11
## [31] readr_1.3.1 purrr_0.3.0 magrittr_1.5
## [34] scales_1.0.0 htmltools_0.3.6 assertthat_0.2.0
## [37] rvest_0.3.2 colorspace_1.4-0 utf8_1.1.4
## [40] stringi_1.3.1 lazyeval_0.2.1 munsell_0.5.0
## [43] crayon_1.3.4